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RNA-seq data can be mined for sequence differences relative to the reference genome to identify both genomic SNPs and 
RNA editing events. We analyzed the long, poIyA-selected, unstranded, deeply sequenced RNA-seq data from the 
ENCODE Project across 14 human cell lines for candidate RNA editing events. On average, 43% of the RNA sequencing 
variants that are not in dbSNP and are within gene boundaries are A-to-G(I) RNA editing candidates. The vast majority of 
A-to-G(I) edits are located in introns and 3' UTRs, with only 123 located in protein-coding sequence. In contrast, the 
majority of non-A-to-G variants (60%-80%) map near exon boundaries and have the characteristics of splice-mapping 
artifacts. After filtering out all candidates with evidence of private genomic variation using genome resequencing or ChlP- 
seq data, we find that up to 85% of the high-confidence RNA variants are A-to-G(I) editing candidates. Genes with A-to- 
G[I) edits are enriched in Gene Ontology terms involving cell division, viral defense, and translation. The distribution and 
character of the remaining non-A-to-G variants closely resemble known SNPs. We find no reproducible A-to-G[I) edits 
that result in nonsynonymous substitutions in all three Iymphoblastoid cell lines in our study, unlike RNA editing in the 
brain. Given that only a fraction of sites are reproducibly edited in multiple cell lines and that we find a stronger as- 
sociation of editing and specific genes suggests that the editing of the transcript is more important than the editing of any 
individual site. 



[Supplemental material is available for this article.] 

RNA editing is a post-transcriptional process that modifies the 
primary RNA and microRNA transcripts. This process can result in 
nonsynonymous protein coding substitutions, alternative splic- 
ing, nuclear retention of mRNA, or alterations of microRNA seed 
regions (for a review, see Nishikura 2010). The most common form 
of RNA editing in mammals is A-to-I editing, in which adenosine is 
deaminated to produce inosine by members of the ADAR (adeno- 
sine deaminases acting on RNA) family of enzymes (Wagner et al. 
1989; Kim et al. 1994; Kumar and Carmichael et al. 1997). C-to-U 
editing in mammals by the APOBEC family of enzymes is thought 
to be much less frequent and much more specific (Gerber and 
Keller 2001). In mammals, ADAR is found within several tissues, 
while AD ARB 1 is known to be active in the brain. Abnormal RNA 
editing has been reported in epilepsy, amyotrophic lateral scle- 
rosis (ALS), brain ischemia, depression, and brain tumors (Maas et al. 
2006; Nishikura 2006; Peng et al. 2006; Paz et al. 2007; Cenci et al. 
2008). 

ADARs recognize double-stranded RNA as their major sub- 
strate, but editing at some sites is very selective for specific A resi- 
dues, while other sites are edited promiscuously and mainly in 
clusters (Nishikura et al. 1991; Poison and Bass 1994). Because 
inosine pairs preferentially with cytidine, I is read as G during 
protein synthesis or during reverse transcription for RNA-seq. 
Known functional consequences of this post-transcriptional mod- 
ification include changes in amino acids in the protein product 
such as in the glutamate and serotonin receptors, creation or de- 
letion of entire exons by changes in splicing, retention of mRNA in 
the nucleus, changes in RNA stability, heterochromatin formation, 
protection against viral RNA, and microRNA modification (Zheng 
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et al. 1992; Burns et al. 1997; Seeburg et al. 1998; Athanasiadis et al. 
2004; Luciano et al. 2004; Prasanth et al. 2005; Wang et al. 2005; 
Agranat et al. 2008). Although the best-studied RNA editing sites 
are in coding sequences that qualitatively change the protein 
product, the majority of known RNA edits in human occur within 
Alu sequences embedded within introns and UTRs (Kim et al. 2004). 
ADAR mouse knockouts are embryonic lethal at day El 1.5 (Wang 
et al. 2004), where it plays an important role in suppressing in- 
terferon signaling to block premature apoptosis in hematopoiesis 
(Iizasa and Nishikura 2009). While microRNAs are known to be 
important RNA editing targets, this present study focuses on 
RNA editing in messenger RNA as measured from polyA-selected 
RNA. 

We surveyed our human polyA+ ENCODE RNA-seq data from 
14 cell types for RNA editing events using a rigorous computational 
pipeline designed to filter out sequencing and read mapping arti- 
facts. We further filtered private genomic single nucleotide vari- 
ants (SNVs) for 12 of the 14 cell types using either 1000 Genomes 
resequencing data (The 1000 Genomes Project Consortium 2010) 
or ENCODE ChlP-seq data sets. We identified between 500 and 
3000 reproducible A-to-I RNA editing events per cell type in bi- 
ological duplicate RNA-seq samples. We then focused on genes 
that are frequently edited across multiple cell types for further 
analysis, and found enrichment for genes involved in basic 
housekeeping processes such as cell division, viral defense, and 
translation. 

Results 

Development and refinement of an RNA-editing pipeline tuned 
based on data from the GM12878 Iymphoblastoid cell line 

RNA-seq data has been mined for known SNPs in expressed genes 
to study allele-specific expression (Montgomery et al. 2010). While 
sequence variants in RNA-seq that are not in the genome are RNA 
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editing candidates, we expect some level of mapping artifact and 
sequencing error in the data, and these could be mistaken for RNA 
editing. We reasoned that a pipeline that maximizes the fraction of 
called known SNPs would then produce the most conservative set 
of RNA editing candidates. One ENCODE cell line, GM12878, was 
particularly well-suited for tuning our pipeline, as it was deeply 
sequenced as part of the 1000 Genomes Project with the results 
incorporated into dbSNP (Sherry et al. 2001). GM12878 2 X 75 bp 
RNA-seq reads were mapped onto an expanded genome that in- 
cludes known splice junctions (Mortazavi et al. 2008) using Bowtie 
as described in the Methods (Figs. 1A,B; Supplemental Table 1). 
Reads that mapped onto splice junctions were set aside because 
they are more prone to mismapping artifacts, and the remainder 
were used for SNV calling. An additional source of false-positive 
SNVs are reads with errors that are amplified during the library 
construction process. We can avoid PCR artifacts by collapsing 
reads, i.e., counting reads with identical starts only once. We fur- 
ther restricted ourselves to sites that had a minimum 10% fre- 
quency in both the entire data set ("uncollapsed set") as well as in 
the smaller data set of nonredundant reads ("collapsed set"). We 
kept only candidate RNA editing sites called from two independent 
RNA-seq replicates (Fig. 1C). Of the SNVs present in the union of 
collapsed and uncollapsed sets, 47% were in the intersection, while 
37% were only in the uncollapsed set and 16% were only in the 
collapsed set (Fig. ID). We found that using the intersection 
strategy delivered a higher fraction of A-to-G calls (20%), while the 
sets that were only in the uncollapsed or only in the collapsed sets 
had A-to-G calls of 14% and 18%, respectively (Fig. IE). We then 
scored our SNVs (Supplemental Table 2) for occurrence within 
dbSNP and found that the intersection had the highest percentage 
of known SNPs (71%) (Fig. IF). 

We further filtered the set of SNVs absent in dbSNP (candi- 
dates for editing because they are not known polymorphisms) for 
their overlap with known transcript boundaries from GENCODE 
v7 protein coding genes, which retained 86% of the candidates. 
SNV calls within known transcripts on the minus strand were 
reassigned to reflect the appropriate substitution in the sense of 
transcription. We then used ANNOVAR (Wang et al. 2010) to an- 
notate the part of the transcript in which the variants occurred. We 
found that A-to-G variants were present primarily in introns and 
UTRs, whereas 82% of the non-A-to-G calls were annotated as 
"splicing/' which is defined here as intronic and within 5 bp of the 
splice junction (Fig. 2A). Inspection of these calls revealed that 
they were primarily due to mismapped reads that should have 
mapped across splice junctions (Fig. 2B). After removing all calls 
within 5 bp of splice sites, we found that >80% of our novel SNVs 
were A-to-G calls, which is 20-fold higher than the 4% of SNV calls 
passing all the same filters that are G-to-A calls (Fig. 2C). This en- 
richment is 20% higher in the intersection of the collapsed and 
uncollapsed sets than it is in either of the outersects (Supplemental 
Fig. 1). However, there remains a chance that there are private 
genomic SNVs that are not yet represented in dbSNP because of 
low-coverage or read mapping issues. We therefore used the ge- 
nomic reads for GM12878 (and GM12891 and GM12892 below) 
from the 1000 Genomes Project to remove transcriptome SNVs 
with one or more genomic reads also supporting that candidate 
variant (Supplemental Table 1). This further increased our A-to-G 
SNV to >85% of the remaining calls. If we assume that our G-to-A 
calls are false positives with respect to editing (whether they are 
true SNPs not found in dbSNP or are sequencing/mapping arti- 
facts), then our false-discovery rate (FDR) for A-to-G RNA editing 
would be <2%. No other SNV class accounted for >6% of our most 



stringently filtered set. T-to-C calls were the second most frequent 
class of SNV. Thity-three percent of these T-to-C were found in 
regions with overlapping GENCODE gene models on opposite 
strands, and another 33% had an unannotated transcript in the 
opposite sense as the overlapping gene model. 

We then tested whether using ChlP-seq reads could be used to 
filter candidate SNVs in cases for which genomic reads are not 
available, as is the case for many ENCODE cell lines analyzed in the 
next section. Surprisingly, we found that we filtered more SNVs 
using ChlP-seq than using the 1000 Genomes data. In order to 
better understand filtering the SNV calls using ChlP-seq data ver- 
sus 1000 Genomes data, we compared the coverage of SNVs in 
the two data-types. Although there were fewer total reads in the 
ENCODE ChlP-seq data than in the 1000 Genomes data, we found 
overall higher coverage in the ChIP data (mean coverage, 57.4) 
than in the 1000 Genomes data (mean coverage, 33.8). Our ChlP- 
seq mean coverage was even higher (60.6) over the 1457 SNVs that 
were filtered when using only the ChIP data, thus allowing us to 
detect rare variants (Supplemental Fig 2). However, there were 
other regions for which the 1000 Genomes data filter out SNVs 
that were missed using the ChIP data. 

SNVs that matched dbSNP have an expected bimodal distri- 
bution with one mode at a frequency of 1.0 and the second mode 
at a frequency of 0.5, which are due to homozygous and hetero- 
zygous SNPs, respectively (Fig. 2D). In contrast, we find that the 
distribution of A-to-G RNA editing calls are skewed rightward with 
a mode at a frequency of 0.2. Nonsplicing, non-A-to-G SNVs show 
a distribution similar to dbSNP SNVs, with a mode at a frequency 
1.0 (Fig. 2E). Relatively few candidate RNA editing events were in 
open reading frames (Supplemental Fig. 3). As expected, our power 
to call SNVs within exons exceeded that of calling them in introns 
for a given expression level (Fig. 2F), given the much greater depth 
of RNA-seq coverage in exons versus introns. We next asked 
whether genes with A-to-(G)I candidate edits had related func- 
tions, and found that they are enriched for Gene Ontology terms 
that are mainly broad anabolic functions such as translation, 
translational elongation, ribonucleoprotein complex, chromo- 
some, centromeric region, ribosome, cytosolic ribosome, mito- 
chondrial nucleoid, melanosome, and coated pit (Supplemental 
Table 3). 

Survey of 14 ENCODE cell lines 

We then applied the above pipeline to polyA RNA-seq from 14 
ENCODE cell types with 2 X 75 bp non-strand-specific protocol, all 
of which express only ADAR (Supplemental Fig. 4). Since se- 
quenced genomes were not available for any cell line other than 
the three lymphoblastoid cell lines, we substituted ENCODE ChlP- 
seq data from HudsonAlpha and histone modifications data from 
the Broad Institute to filter out private genomic SNVs that were not 
represented in dbSNP (Supplemental Table 1). The fraction of 
editing candidates that were of the A-to-G class ranged from 50%- 
85% (Fig. 3A; Supplemental Fig. 5). HepG2 and HUVEC cells had 
the lowest number of candidate RNA editing sites with about 500 
calls each. The filtering of private genomic SNVs using ChlP-seq 
data increased the percentage of A-to-G calls by 5%-20%, with the 
most SNVs filtered out for growth-transformed tumor cell types 
(Fig. 3B). The number of candidate sites called by our pipeline did 
not depend on the sequencing depth of the RNA-seq data set 
(Supplemental Fig. 6) over the range represented in our samples 
(28-135 million reads per replicate), but the amount of filtering did 
depend on the aggregate depth of coverage in the ChlP-seq data 
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Figure 1. RNA SNV calling strategy. (A) Flowchart of analysis: 75-bp paired-end RNA-seq reads were mapped onto an extended genome (genome + 
known splice junctions + spikes) using Bowtie. Reads mapping onto splice sites and spikes were set aside, and reads mapping onto hgl 9 were used to call 
single nucleotide variants (SNVs). A parallel set of analyses was done using a collapsed set of reads with unique coordinates, and the intersections of SNVs 
from the uncollapsed and collapsed treatments were obtained. Known SNPs annotated in dbSNPI 32, sites outside gene boundaries, and intronic sites 
within 5 bp of splice junctions were removed. For the GM trio, any candidate with evidence of a private genomic variation was also removed. (B) Example 
of candidate editing site. Purple arrows pointing to the left represent reads on the (-) strand, while blue arrows pointing to the right represent reads on the 
(+) strand. The blocks represent variants between the reference DNA and the RNA-seq. A SNV is kept when at least three nonidentical reads support the 
SNV, with a minimum SNV frequency of 1 0%, and at least one edit per strand. (C) Intersection strategy for two replicates. For cell types with two replicates, 
the SNVs remaining after collapsing were intersected between the replicates. (D) The number of SNVs remaining after collapsing for the prefiltered sites. 
Number of SNVs that are only in the uncollapsed set are in blue; the intersection, purple; and collapsed set, red. (f ) Collapsing increases the relative 
amount of A-to-G SNVs and also increases the relative number of transitions. Number of SNVs that are only in the uncollapsed set are in blue; the 
intersection, purple; and collapsed set, red. (F) The fraction of dbSNP is highest in the intersection of the full and collapsed sets. The relative amount of calls 
found in dbSNPI 32, novel genie SNVs, and other SNVs in the uncollapsed set are at the left; the collapsed set, right; and the intersection of the two, middle. 



sets. The individual non- A-to-G classes ranged between 0%-10% 
of the total, with T-to-C again being the second most prevalent 
modification. Based on the results in GM12878, it is likely that the 
bulk of the T-to-C edits are A-to-G(I) edits from transcripts on the 
opposite strand relative to their original annotation. Although 
members of the APOBEC family are detectably expressed in all cell 
lines (Supplemental Fig. 7), we observed a very modest proportion 
of C-to-T(U) editing events in HepG2, HUVEC, and HCT116. 
While some of these C-to-T(U) sites might prove to be true RNA 
edits, only a handful of these C-to-T(U) sites are located in AU- 
rich regions known to be associated with APOBEC editing. We 



therefore provisionally conclude that most C-to-T(U) candidates 
are false positives. 

The number of candidate SNVs within coding domains 
summed over all cell types are below 1000, and these are primarily 
nonsynonymous for the non-A-to-G SNVs (Fig. 3C). We find that 
94% (5349 of 5695) of the candidate A-to-G(I) calls are within 
known repeat families, and 98% (5247 of 5349) of these are inAlu's. 
Alu families with the most members also had the most edits (Sup- 
plemental Table 4). This is certainly an underestimate, as our con- 
servative mapping strategy would underreport hyper-edited regions 
with more than three simultaneous edits within our 75-bp reads. 
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Figure 2. RNA editing calls in GM1 2878. (A) Most non-A-to-G SNVsare near splicing boundaries. The distribution relative to gene boundaries of A-to-G 
SNVs (left) versus non-A-to-G SNVs (right). (B) Example of reads mapped incorrectly across a known splice junction. Overhanging RNA-seq reads are 
mapped incorrectly into the intron when the correct position is in the adjacent exon, even though the splice junction was provided to the read mapper. (C) 
Distribution of SNVs at different steps in the pipeline. Prefiltered SNVs defined by having at least three nonidentical reads support the SNV, with a min- 
imum SNV frequency of 1 0%, at least one edit per strand, and no more than one type of SNV for the same position in blue. SNVs annotated in dbSNPI 32 
are red, SNVs that are not in dbSNPI 32 and within gene boundaries are green, SNVs that are not in dbSNPI 32 and within gene boundaries without 
splicing sites are purple, SNVs that had no matching 1 000 Genome sequencing reads are in light blue, and SNVs passing ChIP filtering are in orange. (D) 
Frequency distribution of SNVs primarily reflects expression of homozygous and heterozygous SNPs. The SNVs that were found in dbSNPI 32 are in blue; 
the novel genie SNVs, red. (f ) Most nonsplice adjoining SNPs are A-to-G. The nonsplicing novel genie A-to-G calls in filtered calls are in blue; nonsplicing 
novel genie A-to-G calls, red; nonsplicing novel genie non-A-to-G, brown; nonsplicing novel genie non-A-to-G in filtered calls, purple; and splicing-only 
novel genie, light blue. (F) Distribution of gene expression versus coverage of exonic sites are in red and intronic sites are in blue for genie SNVs. SNVs in 
more lowly expressed genes are primarily on exons, due to our minimum depth of coverage requirements. 



We next focused on individual sites with evidence of editing 
in multiple cell types. Overall 33.5% (1905 out of 5695 possible) 
of individual A-to-G(I) candidate editing sites occur independently 
in two or more different cell types (Fig. 4A; for non-A-to-G classes, 



see Supplemental Fig 8). We also found that 24% (1386/5695) of 
our candidate A-to-G(I) calls intersect with the DARNED database 
of RNA editing in human (Kiran and Baranov 2010), which was 
generated from mining human ESTs. This low overlap is expected, 
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Figure 3. Survey of SNV calls across ENCODE cell lines. (A) Distribu- 
tion of nonsplicing novel genie SNVs for all data sets. (B) In every cell 
type, the percentage of A-to-G SNVs increase and the number of can- 
didate sites decrease (red) after filtering for private SNVs using ChlP-seq. 
GM1 2878 calls were filtered with 1 000 Genomes or ChlP-seq reads are 
labeled with G or C, respectively. (C) Relatively few non-A-to-G synon- 
ymous SNVs (purple), non-A-to-G nonsynonymous SNVs (green), A-to-G 
synonymous SNVs (red), A-to-G nonsynonymous SNVs (blue) are found 
in ORFs. 



since the ENCODE RNA-seq data analyzed here did not include 
neuronal tissues or cell lines. We found 28 genes that were edited in 
all of our cell types, and 20 genes of these (71%) are also in the 
DARNED set. We found that 47.4% (662 out of 1396 possible) of 
genes edited are called as edited in at least two or more different cell 
types (Fig. 4B). We therefore conclude that gene-level association 
with editing is more robust than the identity of an individual site 
edited. 

We then focused on the distribution of calls within gene 
models. For most genes, A-to-G(I) RNA editing candidates were 
either all in intronic regions or all in UTRs, although l%-4% of 
edited genes had both intronic and UTR sites, depending on the 
cell type (Fig. 4C). Because editing events are — overall — rare in the 
transcriptome and because editing often covers only a fraction of 
transcripts from a given gene, we wanted to probe our sensitivity 
for calling events in replicate samples. While most ENCODE data is 
in duplicates, we had an instance of data from four replicate de- 
terminations for Human HI ES. We compared calls from Human 



HI ES cell reps 1 and 2 with calls from reps 3 and 4 (Fig. 4D) and 
found that the number of edits per gene that we had called was 
quite noisy when there were only a few candidate sites that were 
called per gene. This is possibly due to the stochastic and pro- 
miscuous nature of ADAR's ability to hyper-edit dsRNA (Poison 
and Bass 1994) or due to the sensitivity of pipeline to coverage at 
the low end of the RNA expression spectrum. 

We compared the filtered A-to-G(I) SNV calls in GM12878 
to those in its parents, GM12891 and GM12892, to help assess 
the stability of RNA editing within a single cell-type, i.e., EBV- 
transformed lymphoblastoid cells. GM12891 and GM12892 had 
1885 (86%) and 843 (87.7%) candidate A-to-G(I) sites located in 
479 and 265 genes, respectively. GM12878 shared 490 sites (337 
genes) with GM12891 and 292 sites (218 genes) with GM12892. 
While 26.2% of the individual editing SNVs were found in at least 
two individuals of the trio, >49.6% of the genes were in common 
(Fig. 4E). Thus the gene-level association with editing is also more 
reproducible than the identity of individual sites edited within 
different cell lines of the same type. 

Overall, there were 248 out of 1396 genes that were edited in 
at least five out of the 10 distinct cell types after we applied the 
ChlP-seq filter (Fig. 4A). Our GO analysis of these genes showed 
enrichments that included interspecies interaction between or- 
ganisms, cell division, DNA metabolic process, positive regulation 
of defense response to virus by host, protein folding, ER-Golgi 
intermediate compartment, ribosome, and ribonucleoprotein com- 
plex (Supplemental Table 5). Two genes that are especially highly 
edited within multiple cell types are the inhibitor of apoptosis 
XIAP and the caspase-3 target DFFA, suggesting an explicit and 
direct link to the apoptosis phenotype of the mouse ADAR 
knockout. 

The overwhelming prevalence of A-to-G(I) RNA editing can- 
didates in our lymphoblastoid trio agrees with a recent publication 
(Peng et al. 2012) but differs qualitatively from a previously 
reported analysis of RNA editing in a nonoverlapping set of 
HapMap lymphoblastoid lines (Li et al. 2011). To make a more in- 
formed comparison, we applied our pipeline to the data sets from 
Li et al. (201 1) though we note that their data were single replicate 
RNA-seq measurements with shorter 50-bp reads that were more 
shallowly sequenced (40 million reads vs. our average of 100 mil- 
lion reads). We found that some of their individual samples pro- 
duced similar genie, nonsplicing A-to-G enrichments as in our 
lymphoblastoid lines, while others had an especially high per- 
centage of A-to-C and T-to-G classes (Supplemental Figs. 9, 10). To 
check whether these could be DNA sequencing artifacts from ear- 
lier and different (2008-2009) Illumina chemistry, we checked 
the output of our pipeline on an independent data set (ENCODE 
GM12878 DNase-seq) from that same earlier timeframe. We found 
relatively low numbers of non-dbSNP SNVs in the DNase-seq data 
and found that A-to-C and T-to-G classes were again comparatively 
enriched (Supplemental Fig. 11), suggesting that a similar sys- 
tematic bias exists in some of the Li et al. (2011) samples as in 
genomic DNA sequenced with that technology in that timeframe; 
these are not therefore attributable to RNA editing. Upon further 
inspection, we found that the particular sites most affected are 
embedded in G-rich regions (Supplemental Fig. 12). In addition, 
recent reports (Schrider et al. 201 1; Pickrell et al. 2012) attribute the 
majority of the non-A-to-G calls in Li et al. (2011) to sequences 
that are paralogs of the gene that were reported as edited, i.e., 
a mismapping error. Therefore we only report our ChlP-filtered 
A-to-G(I) RNA editing candidates for each cell line in Supplemental 
Tables 6 through 19. 
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Discussion 

We developed an intentionally conservative strategy to identify 
candidate RNA edits from RNA-seq data for the human ENCODE 
cell lines. We analyzed the transcriptomes of 14 cell types and 
compared the RNA sequences with the reference genome, filtered 
out known SNPs from dbSNP, and filtered out private SNPs de- 
tected in ChlP-seq data from each cell line. We found that SNVs in 
the intersection of our "collapsed" and "uncollapsed" mapping 
sets yielded the highest fraction of known SNPs; this observation 
functions as a positive control that our SNV calls are not signifi- 
cantly biased by mapping artifacts. We also found that incorrect 
read mapping across splice junctions was the source of the ma- 
jority of non- A-to-G calls. We further developed a strategy to fil- 
ter out nonediting transcriptome SNVs by using up to 2.0 billion 
ChlP-seq reads in GM12878 and 0.1-1.7 billion ChlP-seq reads 
in the other nine cell lines that do not have resequencing 
data available. Together with filtered calls in two additional 
lymphoblastoid with 1000 Genomes data, we therefore have reli- 
able A-to-G(I) editing candidates in 12 of our 14 cell lines (Sup- 
plemental Tables 6-19). 

Up to 87% of SNVs that are not SNPs (either in dbSNP or 
private genomic SNVs) are A-to-G calls; this suggests they are likely 
to be A-to-I editing candidates. Furthermore, >97% of these can- 
didates are located in introns and 3 ' UTRs, which is consistent with 
what was previously known about RNA editing based on earlier 
EST surveys and recent reports (Bahn et al. 2012; Peng et al. 2012). 



Our candidate A-to-G(I) RNA editing sites have a different variant 
frequency from known SNPs. They tend to cluster predominantly 
in the 3' UTR or in introns. In the three cell lines with the least 
amount of A-to-G (I) editing, there were relatively more C-to-T(U) 
SNVs, but these were not associated with AU-rich regions, as would 
have been expected if these are due to APOBEC activity. We also 
found that individual RNA editing calls are noisy for lowly ex- 
pressed genes because of depth of coverage requirements for edit- 
ing calls. If sensitivity to a conservative threshold was the major 
source of noise from one data set to another, and one replicate 
to another, then the identity of genes that are edited would be 
more consistent than the actual edits as long as edits occur in 
clusters. 

Overall, we report 5695 unique candidate A-to-G(I) RNA 
editing events in 1396 genes, including a subset of 248 genes that 
were consistently edited across more than five cell types. Ninety- 
nine percent of the candidate RNA editing calls occurred within 
known repeats (with 98% of those in Alu elements), and only 24% 
were annotated in DARNED, which is expected since none of the 
ENCODE cell lines in this study cover neuronal phenotypes. 
Comparing our results to previously reported widespread evidence 
of noncanonical RNA editing besides A-to-G(I) in a different set of 
HapMap lymphoblastoid cell lines (Li et al. 2011), we found that 
sequencing and mapping artifacts could account for the vast ma- 
jority of the unconventional (non-ADAR) variant calls. Moreover, 
we showed that this is caused, in part, by RNA data generated using 
older sequencing chemistry. 
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Genes with at least one edit within all cell types show GO 
term enrichments in "housekeeping" annotations related to cell 
division, DNA metabolic process, protein folding, and ribosome, as 
well as terms relating to viruses and defenses against them. The 
latter functions are consistent with the view that editing arose first 
in this context. For example, the protein EIF2AK2 inhibits protein 
synthesis upon activation by viral RNA and has an average of 13 
edits across all cell-types. This meshes well with reports that RNA 
editing participates in host-viral interactions, such as the editing of 
the hepatitis C virus (HCV) genome by ADAR (Taylor et al. 2005). 
Interestingly, some viruses have been able to take advantage of 
ADAR for their own purposes; endogenous ADAR has been shown 
to stimulate HIV-1 replication (Doria et al. 2009). 

When searching for RNA editing events to create a global map 
of high-quality candidates, there is a difficult tradeoff between 
sensitivity (identifying a highly inclusive set of possible edits) and 
specificity (being more confident that a call is in fact a true RNA 
edit). We judged it better to have a smaller number of candidate 
RNA editing events that are highly likely to be true than to have 
a larger number with an increased percentage of false positives. We 
undoubtedly lost a substantial number of true, low-level A-to-I 
RNA editing events in the process, and users of these RNA-Seq data 
might for some purposes want to build a more inclusive and less 
certain list. Another caveat that applies to our pipeline is that our 
method allows for a maximum of three edits per 75 bp within our 
reads. Thus, if there was a 7 5 -bp window that had significantly 
more than three edits, our pipeline would only detect the edits on 
the periphery of this hyper-edited domain. We could allow more 
mismatches during the alignment step, but then this would ex- 
acerbate the problem of incorrectly mapped reads. To date, the 
RNA-seq field has emphasized aggressive read mapping to maxi- 
mize sensitivity and new candidate discovery. This inevitably 
comes at the expense of specificity, and this effect is greater in 
complex genomes with extensive paralogous gene and repeat 
families such as those of mammals. Our current view is that 
there is no single correct threshold for the sensitivity specificity 
tradeoff: It has to be selected to match the objective of a given 
study and the future use of each analysis. When an analysis focuses 
on the small portion of sequence reads that imply differences from 
the majority that match well to the appropriate genome/tran- 
scriptome models, a general trend toward greater conservatism 
seems justified. We show here that an early sequencing chemistry 
issue and a problem with accurate read mapping over splice 
junctions are two specific contributors to false-positive RNA edit- 
ing candidates. We have greatly reduced these in our pipeline and 
present a conservative set for the ENCODE cell lines. In doing so, 
we demonstrated that there is no persuasive evidence in favor of 
noncanonical editing. A-to-G(I) edits strongly dominated our data 
except for three cell lines with modest evidence of C-to-T(U) SNVs, 
and no unknown edit was above the noise level. 

It is a sobering caution that, even when reads were mapped 
simultaneously against the genome and known splice junctions, 
we still had obvious splice-mapping artifacts. We were therefore 
forced to exclude all calls that mapped within a few bases of splice 
junctions; this, in turn, dramatically reduced non-A-to-G(I) calls. 
There remain a few noncanonical calls that pass our criteria, but we 
expect these to be primarily undetected private SNVs, regions with 
complicated mapping issues due to paralogous genes and pseu- 
dogenes in the genome, or uncharacterized splice junctions. This 
does not preclude the possibility of some very specific APOBEC- 
like, noncanonical editing of bases but calls into question their 
previously reported widespread occurrence (Li et al. 2011), which 



has been recently called into question (Kleinman and Majewski 
2012; Lin et al. 2012; Pickerell et al. 2012). 

While GM12878 had been "deeply" resequenced by 2010 
standards, we find that filtering with ChlP-seq was actually more 
effective. While this may seem paradoxical at first, ChlP-seq signal 
as well as background reads are most likely to come from open 
chromatin within transcribed genes. Although the 1000 Genomes 
project has greater coverage throughout the genome, we had in 
fact greater overall coverage of the candidate RNA SNVs in the 
pooled ChlP-seq data. While the amount of ChlP-seq data in 
GM12878 and other ENCODE Tier 1 and Tier 2 cell lines is ex- 
ceptional, they highlight the issue that really high coverage is 
necessary to detect a subset of SNVs. The fact that private genomic 
SNVs need to be accounted for and filtered out for assaying RNA 
editing suggests that we can use RNA-seq to identify SNVs in 
expressed genes. In effect, the depth of coverage in RNA-seq over 
medium to highly expressed genes achieves many of the same 
benefits of whole-exome sequencing in rare variant discovery. In 
the future, deconvolving rare genomic variants that are detectable 
by RNA-seq from true RNA editing events will be done more op- 
timally by simultaneous analysis of the raw reads from RNA-seq 
and genome-resequencing events at even higher coverage than 
readily available today. 

Methods 

Reads for each biological replicate data set were mapped to an 
expanded genome consisting of the human reference (GRCv3 7 / 
UCSC hgl9) plus GENCODE v7 splice junctions and added spike 
sequences using Bowtie, version 0.12.7 (Langmead et al. 2009), 
with at most three mismatches; reporting in SAM format up to one 
valid alignment per read; suppressing all alignments for a particu- 
lar read if more than one reportable alignment exist for it; and 
using only those alignments that fell into the best stratum. We 
used Bowtie to map reads instead of TopHat because the SNVs that 
were called using TopHat did not have a significant difference in 
SNV distribution when using RNA-seq reads or DNase-seq reads. 
Read ends were mapped separately and pooled afterward, without 
taking into account pairing information. The resulting SAM files 
were then stripped of spliced reads; converted to BAM files using 
samtools, version 0.1.17; sorted; and indexed; variants were called 
using the pileup command. We called an SNV when at least three 
nonidentical reads support a nonreference variant, and the variant 
is present at a minimum frequency of 10% and is supported by at 
least one read per strand. We discarded sites with more than one 
type of SNV call at the same location. In addition to the SNV calls 
in each full data set, a parallel set of analyses was done with po- 
tentially duplicated reads removed using the rmdup option of 
samtools to create the collapsed set. 

The intersection of SNV calls in biological replicates from the 
full BAM file and the intersection of SNV calls in the collapsed 
BAM files were intersected to create a list of candidate SNVs. 
Known SNPs from dbSNP132 that were not annotated as based 
on cDNA and sites lying outside the "comprehensive set" of 
GENCODE v7 protein coding gene boundaries were set aside, and 
the remaining novel SNVs within genie regions were corrected for 
strand sense. These novel genie candidates were then annotated 
using ANNOVAR (Wang et al. 2010) with a splicing threshold of 5. 
Gene Ontology analysis was performed using GREAT, version 1.8.2 
(McLean et al. 2010). Sites that were annotated by ANNOVAR as 
splicing were filtered out. To filter out genomic SNVs, genomic 
alignments were obtained from the 1000 Genomes project for 
the CEU GM trio, and ChlP-seq alignments were obtained from 
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ENCODE ChlP-seq data from HudsonAlpha and histone modifi- 
cation data from the Broad Institute. Samtools mpileup was used to 
look at the nucleotide composition over the SNVs, and sites with 
any evidence of a genomic SNV were filtered out. Hierarchical 
clustering for editing frequency of individual sites and number of 
edits for genes was done using Cluster 3.0 (de Hoon et al. 2004), 
with the centroid linkage hierarchical clustering option of both 
sites/genes and cell types. The heatmap was viewed using TreeView- 
1.1.5 (Page 2002). 

Data access 

All RNA-seq data are publicly available from the ENCODE re- 
pository at the UCSC Genome Browser (http://genome.ucsc.edu/ 
ENCODE/). Details of accession numbers can be found in Sup- 
plemental Table 20. 
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